The ThreeDRNAseq (3D RNA-seq) R package is designed for use by biologists to analyze their own RNA-seq data (Guo et al., 2019). It provides an interactive graphical user interface (GUI) for differential expression (DE), differential alternative splicing (DAS) and differential transcript usage (DTU) (3D) analyses of RNA-seq data based on the popular pipeline limma (Ritchie et al., 2015). It also integrated transcript isoform switch tools, such as IsoKtsp (Sebestyen et al., 2015) and TSIS (Guo et al., 2017) for an enhanced alternative splicing analysis. The 3D RNA-seq removes all the unnecessary complexities associated with differential expression analysis and enables a complete RNA-seq analysis to be done quickly (3 Days, thus 3D). It allows complex experimental designs such as time-series, developmental series and multiple conditions. It employs state-of-the-art methods/statistics and generates 3D results quickly and accurately.
3D analysis pipeline
The 3D RNA-seq App was developed by Dr. Wenbin Guo from research into the analysis of time-series RNA-seq data (Calixto et al., 2018) with help from Dr. Cristiane Calixto and Dr. Nikoleta Tzioutziou and guidance from Prof. John W.S. Brown and Dr. Runxuan Zhang from the University of Dundee - School of Life Sciences and the James Hutton Institute - Information and Computational Sciences. We acknowledge Dr. Iain Milne and Gordon Stephen for technical support. To use our pipeline in your work, please cite:
3D RNA-seq App “easy-to-use” manual:
https://github.com/wyguo/ThreeDRNAseq/blob/master/vignettes/user_manuals/3D_RNA-seq_App_manual.md
In the GUI, users can click tooltips in specific steps for help information.
3D RNA-seq App is developed and maintained by Dr. Wenbin Guo from the Plant Sciences Division, School of Life Sciences, University of Dundee. If you have any questions and suggestions, please contact:
The 3D RNA-seq App docker image is hosted by the James Hutton Institute server. Open the App by this link: https://ics.hutton.ac.uk/3drnaseq
To perform 3D analysis, no installation is required. Users need to upload input data to our server. All results, reports, figures and intermediate data will be zipped and downloaded in the final step.
To run the 3D RNA-seq App through RStudio on a local PC, users do not need to upload the data to our server and all the outputs will be directly saved to the App working directory. Please run the following command to install ThreeDRNAseq R package and the packages of dependencies.
Install dependency packages
#####################################################################################
## Install packages of dependency
###---> Install packages from Cran
cran.package.list <- c("shiny","shinydashboard","rhandsontable","shinyFiles",
"shinyjs","shinyBS","shinyhelper","shinyWidgets",
"magrittr","DT","plotly","ggplot2","eulerr",
"gridExtra","grid","fastcluster","rmarkdown","base64enc",
"ggrepel","zoo","gtools")
for(i in cran.package.list){
if(!(i %in% rownames(installed.packages()))){
message('Installing package: ',i)
install.packages(i,dependencies = T)
} else next
}
###---> Install packages from Bioconductor
bioconductor.package.list <- c('tximport','edgeR','limma','RUVSeq',
'ComplexHeatmap','rhdf5')
for(i in bioconductor.package.list){
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if(!(i %in% rownames(installed.packages()))){
message('Installing package: ',i)
BiocManager::install(i, version = "3.8",dependencies = T)
} else next
}
Install ThreeDRNAseq R package
ThreeDRNAseq R package can be installed from Github by using devtools R package
##################################################################################################
## use devtools R package to install ThreeDRNAseq from Github
###---> If devtools is not installed, please install
if(!requireNamespace("devtools", quietly = TRUE))
install.packages('devtools',dependencies = TRUE)
###---> Install ThreeDRNAseq
if(!requireNamespace("ThreeDRNAseq", quietly = TRUE))
devtools::install_github('wyguo/ThreeDRNAseq')
Run 3D RNA-seq App
library(ThreeDRNAseq)
run3DApp()
The ThreeDRNAseq R package can be used as a normal R package. The vignette of command line for 3D analysis can be found in:
NOTE:
Gather the meta-data of the experimental design in a csv spreadsheet, the columns of which must include the following information (Figure 1A):
A folder that contains the transcript quantification files. Each file contains transcript quantification data of a single sample. Read counts and TPMs for 3D analysis will be generated from the “quant.sf” objects if these files are generated by Salmon (Patro et al., 2017) and the “abundance.tsv” objects if these files are generated by Kallisto (Bray et al., 2016) (Figure 1B).
Transcript-gene mapping file in one of the following formats:
Note:
Transcript-gene mapping in “csv” file is recommended. Depending on the size, it may take a while to generate the table from a “fa” or a “gtf” file and any missing tags for transcript name and gene ID extraction in these files may lead to errors.
The 3D analysis is executable in a computer with normal memory and CPU size. If the App is running on docker image, it is recommended to reduce the data size to upload to our server. Users can exclude all the files in sub-folders of transcript quantifications, except the files of “quant.sf” from Salmon. If the transcript quantifications are generated using Kallisto, users can keep “abundance.tsv” in the sub-folders and remove the other files (Figure 1B).
Figure 1: Input files of 3D RNA-seq App. The example is from a RNA-seq study of Arabidopsis in respond to cold (Calixto et al., 2018). (A) Meta-data table of sample information in csv file. (B) The folder contains transcript quantifications. The input of transcript-gene mapping information can be a “csv” spreadsheet with first column of transcript names and second column of gene IDs (C), a “.fa” file (D) or a “.gtf” file (E) of the transcriptome. If a “.fa” or a “.gtf” file is provided, the App will extract transcript-gene association information with specific tags.
The results of the 3D RNA-seq analysis are saved in the App working direcotry in four folders:
Download link: https://www.dropbox.com/s/8vwuz6u2yl7v9qx/3D%20RNA-seq%20App%20example%20data.zip?dl=0
Description: This example is a sub-dataset from a time-series study of Arabidopsis plants exposed to cold (Calixto et al., 2018). RNA-seq data of 6 time-points were extracted from the whole dataset. The time-points are 3 and 6 hours after dusk at \(20^oC\), the first day of transition to \(4^oC\) and the fourth day of acclimation to \(4^oC\) (red boxes in Figure 2). Each time-point has 3 biological replicates and each biological replicate has 3 sequencing replicates. Transcript quantifications were generated using Salmon (Patro et al., 2017) and AtRTD2-QUASI (Zhang et al., 2017) as the reference transcriptome. To further reduce the data size, the expression of 8,944 transcripts (from 2,000 genes ) were extracted from the Salmon quantification to identify the cold response genes and transcripts at both transcriptional and AS level.
Figure 2: Experimental design of time-series RNA-seq data from study of Arabidopsis cold response. For this example, a subset of samples, genes and transcripts were extracted from the whole dataset to reduce data size.
3D RNA-seq App is made of control widgets that users can interact with and send messages to the underlying R code to perform analysis by simple clicking/dragging of the mouse (Figure 3).
Figure 3: Basic widgets of 3D RNA-seq App.
Running 3D RNA-seq just requires users to follow the steps from Tab panel 1 to 7.
Users can upload input files and experimental design metafile in this tab panel. Transcript level and gene level read counts and TPMs (transcript per million reads) are generated by using the tximport R package (Soneson et al., 2016).
Step 1: Upload input files to 3D RNA-seq App.
Step 2: Select the columns of factors, labels of replicates and quantification file names from meta-data table for 3D analysis.
Step 3: Generate gene and transcript read counts and TPMs.
When the button “Run” is clicked, the process currently running will be shown at the lower right corner of the web browser. Once it is done, you can move to Tab panel 2 using the navigation menu on the left and make sure you roll your page to the very top.
Once the read counts and TPMs are generated, the data will go through a number of pre-processing steps . In each step, quality control plots are generated to optimise the parameters for pre-processing. If the RNA-seq data has sequencing replicates (seq-reps), they will be merged before 3D analysis according to the seq-rep labels selected by users in the meta-data table to increase sequencing depth, because seq-reps are generated by sequencing the same biological replicate multiple times (e.g. on different sequencing lanes), but they do not add much variability to the biological replicates.
Step 4: Filter low expressed transcripts and genes based on expression mean-variance trend.
Step 5: PCA plot and removing batch effects.
The PCA plot can also be used to identify whether the RNA-seq data contains batch effects, which are caused by biological replications being prepared in different, for example, laboratory conditions
In this panel, users can select and visualise different PCs based on transcript level or gene level expression of all samples or the average expression of biological replicates. The scatter points can be grouped and coloured according to different factors. Ellipses or polygons can be added to the plots to highlight the grouped clusters.
Step 6: Data normalisation.
For unbiased comparisons across samples, read counts must be normalised. Normalisation methods such as Trimmed Mean of M-values (TMM), Relative Log Expression (RLE) and upper-quartile can be used to reduce the effect from the systematic technical biases across samples (Bullard et al., 2010). Box plots are used to visualise the expression distribution of raw read counts and normalised expression across samples.
Once the normalization is done as shown at the lower right corner on the browser, please proceed to Tab panel 3.
Step 7: Set contrast groups and perform 3D analysis.
A contrast group is a user-defined comparison between two samples or two groups of samples:
NOTE: if the experimental design involves multiple factor levels, e.g. condition A, B and C are performed in wildtype (WT) and mutant (MU), respectively, these factors will be combined to generate group labels
Step 8: Set statistical parameters and perform 3D analysis
Significant result summary
After the the 3D analysis, the following information is summarized and will appear at the bottom of the page:
These summaries can be filtered, customized and the figures can be saved with specified formats and sizes.
Step 9: Generating heat-maps.
Users can make heat-maps of significant DE genes, DAS genes, DE transcripts and DTU transcripts identified from the analysis. The heat-maps and the gene/transcript list in each cluster of the heat-maps can be saved to local folder.
Step 10: Generating profiles and percent spliced plots.
Gene and transcript expression profiles (TPMs or read counts) and PS (percent spliced) can be visualised by typing a gene id. Users can also generate a number of plots by providing a gene list and all the plots can be saved to “figure” folder of the App working directory.
Step 11: Generating GO enrichment plot.
Users can generate gene lists of DE genes, DAS genes, DE transcripts and DTU transcripts by clicking “Download gene list” button. These gene lists can be uploaded to Gene Ontology (GO) analysis tools/databases (e.g. DAVID and agriGO) to generate GO annotation. A csv file with GO annotation information is required to generate the annotation plot. The file includes a column of “Category” of CC (cellular component), BP (biological process) and MF (molecular function), a column of “Term” of GO annotation and the rest columns of statistics to report the annotation enrichment, e.g. count, -log10(FDR), etc.
Transcript isoform switches (ISs) occur when within a gen a pair of alternatively spliced isoforms reverse the order of their relative expression levels. IsokTSP is a method to detect transcript ISs between pair-wise conditions (Sebestyen et al., 2015) while TSIS (time-series IS) is used to identify ISs in time-series data (Guo et al., 2017).
Step 12: Perform isoform switch analysis.
After the analysis is done, a number of plots will be generated automatically to visualize the results.
Step 12: generate report and download all the results.
Publication-quality reports will generated and together with figures, results and statistics of significance can be downloaded in a zipped file.
Reports are saved in report folder.
| File name | Description |
|---|---|
| 3D_report.pdf/html/doc | Report of 3D analysis in pdf, html and doc format |
| File names (alphabetical) | Description |
|---|---|
| DAS genes euler plot across contrast*.pdf/.png | Euler diagram to compare DAS genes in different contrast groups |
| DAS genes GO annotation plot.pdf/.png | DAS genes GO annotation plot |
| DE genes euler plot across contrast*.pdf/.png | Euler diagram to compare DE genes in different contrast groups |
| DE genes GO annotation plot.pdf/.png | DE genes GO annotation plot |
| DE genes up and down regulation numbers.pdf/.png | DE genes up and down regulation numbers |
| DE transcripts euler plot across contrast*.pdf/.png | Euler diagram to compare DE transcripts in different contrast groups |
| DE transcripts up and down regulation numbers.pdf/.png | DE transcripts up and down regulation numbers |
| DE vs DAS gene euler plot in contrast*.pdf/.png | Euler diagram to compare DE and DAS genes in different contrast groups |
| DE vs DTU transcript euler plot in contrast*.pdf/.png | Euler diagram to compare DE and DTU transcripts in different contrast groups |
| DTU transcripts euler plot across contrast*.pdf/.png | Euler diagram to compare DTU transcripts in different contrast groups |
| Gene expression distribution.pdf/.png | Gene expression distribution |
| Gene mean-variance trend.pdf/.png | Gene mean-variance trend plot |
| Gene PCA all samples*.pdf/.png | Gene PCA plot of all samples |
| Gene PCA average expression.pdf/.png | Gene PCA plot of average expression |
| Gene PCA batch effect removed all samples*.pdf/.png | Gene PCA plot of all samples after removing batch effects |
| Heatmap DAS genes.pdf/.png | Heat-map of DAS genes |
| Heatmap DE genes.pdf/.png | Heat-map of DE genes |
| Heatmap DE transcripts.pdf/.png | Heat-map of DE transcripts |
| Heatmap DTU transcripts.pdf/.png | Heat-map of DTU transcripts |
| Isoform switch number.png/.pdf | Number of significant isoform switch numbers |
| Profile/Abundance/PS plots | Folders contain gene/transcript profile plots |
| Transcript expression distribution.pdf/.png | Transcript expression distribution |
| Transcript mean-variance trend.pdf/.png | Transcript mean-variance trend plot |
| Transcript PCA all samples*.pdf/.png | Transcript PCA plot of all samples |
| Transcript PCA average expression.pdf/.png | Transcript PCA plot of average expression |
| Transcript PCA batch effect removed all samples*.pdf/.png | Transcript PCA plot of all samples after removing batch effects |
| Union set DE genes vs DAS genes.pdf/.png | Flow chart -Union set DE genes vs DAS genes |
| Union set DE transcripts vs DTU transcripts.pdf/.png | Flow chart -Union set DE transcripts vs DTU transcripts |
Important results are saved in csv (comma delimited) files.
| File names (alphabetical) | Description |
|---|---|
| contrast.csv | Contrast groups used for 3D analysis. |
| DDD genes and transcript lists across all contrast groups.csv | List of DE genes, DAS genes, DE transcripts and DTU transcripts, which are the union sets across all contrast groups. |
| DDD numbers.csv | DE/DAS/DTU genes/transcript numbers in each contrast group. |
| DEvsDAS results.csv | Number of DE vs DAS genes. |
| DEvsDTU results.csv | Number of DE vs DTU transcripts. |
| Gene read counts.csv | Raw read counts of genes before data pre-processing. |
| Gene TPM.csv | Raw TPM of genes before data pre-processing. |
| Raw isoform switch scores.csv | Statistics of all possible isoform switches. |
| RNAseq info.csv | RNA-seq data information before and after pre-processing. |
| samples.csv | Meta-data table of sample information. |
| Significant DAS and not DAS genes list and statistics.csv | DAS gene test statistics of all genes. |
| Significant DAS genes list and statistics.csv | DAS gene test statistics of significant genes. Not DAS genes are removed from the table. |
| Significant DE and not DE genes list and statistics.csv | DE gene test statistics of all genes. |
| Significant DE and not DE transcripts list and statistics.csv | DE transcript test statistics of all transcripts. |
| Significant DE genes list and statistics.csv | DE gene test statistics of significant genes. Not DE genes are removed from the table. |
| Significant DE transcripts list and statistics.csv | DE transcript test statistics of significant transcripts. Not DE transcripts are removed from the table. |
| Significant DTU and not DTU transcripts list and statistics.csv | DTU transcript test statistics of all transcripts. |
| Significant DTU transcripts list and statistics.csv | DTU transcript test statistics of significant transcripts. Not DTU transcripts are removed from the table. |
| Significant isoform switch scores.csv | Statistics of significant isoform switches. |
| Target in each cluster heatmap *(DAS only are filtered).csv | DE gene lists in clusters of DE gene heatmap. |
| Target in each cluster heatmap *DE genes.csv | DE+DAS gene lists in clusters of DAS gene heatmap. The DAS only genes are excluded since they have no significant abundance changes across samples. |
| Target in each cluster heatmap *DE transcripts.csv | DE transcript lists in clusters of DE transcript heatmap |
| Target in each cluster heatmap *transcripts (DTU only are filtered).csv | DE+DTU transcript lists in clusters of DTU transcript heatmap. The DTU only transcripts are excluded since they have no significant abundance changes across samples. |
| Transcript and gene mapping.csv | Transcript-gene mapping table. |
| Transcript read counts.csv | Raw read counts of transcripts before data pre-processing. |
| Transcript TPM.csv | Raw TPM of transcripts before data pre-processing. |
Intermediate data in .RData for 3D RAN-seq analysis are saved in the data folder. There are three .RData objects: 1) txi_trans.RData and 2) txi_genes.RData are transcript and gene level read count and TPM outputs from the tximport R package (Soneson et al., 2016). All the intermediate data generated in the process of 3D analysis is saved in the list object intermediate_data.RData. R users can access to the data using command line.
| List object | Elements in list object | Element type | Description |
|---|---|---|---|
| intermediate_data.RData | conditions | character | Labels of conditions to study |
| contrast | character | Contrast groups | |
| DAS_genes | data.frame | Statistics of significant DTU transcripts | |
| DDD_numbers | data.frame | Number of DE/DAS/DTU genes/transcripts in contrast groups | |
| DE_genes | data.frame | Statistics of significant DE genes | |
| DE_trans | data.frame | Statistics of significant DE transcripts | |
| deltaPS | data.frame | Delta PS based on contrast groups | |
| DEvsDAS_results | data.frame | Number of DE vs DAS genes | |
| DEvsDTU_results | data.frame | Number of DE vs DTU transcripts | |
| DTU_trans | data.frame | Statistics of significant DTU transcripts | |
| genes_3D_stat | list | All the raw results of linear regression and statistics of DE genes | |
| genes_batch | list | Estimated gene level batch effects, if they exist. 1) W: matrix, estimated batch effect term, which can be added to design matrix of linear regression; 2) normalizedCounts: matrix, read counts where batch effects are removed; 3) method: a string, method used to estimate batch effects. | |
| genes_counts | data.frame | Read counts of genes. Seq-reps are merged if exist. | |
| genes_log2FC | matrix | log2-CPM of genes | |
| genes_TPM | matrix | TPMs of genes | |
| mapping | data.frame | Transcript-gene mapping | |
| params_list | list | Parameters used for the 3D analysis | |
| PS | matrix | Percent spliced (PS) of expressed transcripts | |
| RNAseq_info | data.frame | RNA-seq data information before and after pre-processing | |
| samples | data.frame | Sample information. | |
| samples_new | data.frame | Sample information after merging sequencing replicates (seq-reps, if exist). | |
| scores | data.frame | Statistics of isoform switches | |
| scores_filtered | data.frame | Statistics of significant isoform switches | |
| target_high | list | 1) trans_high: character, expressed transcripts; 2) genes_high: character, expressed genes; 3) mapping_high: data.frame, expressed transcript-gene mapping | |
| trans_3D_stat | list | All the raw results of linear regression and statistics of DAS genes, DE and DTU transcripts | |
| trans_batch | list | Estimated transcript level batch effects, if they exist. 1) W: matrix, estimated batch effect term, which can be added to design matrix of linear regression; 2) normalizedCounts: matrix, read counts where batch effects are removed; 3) method: string, method used to estimate batch effects. | |
| trans_counts | data.frame | Read counts of transcripts. Seq-reps are merged if exist. | |
| trans_log2FC | matrix | log2-CPM of transcripts. | |
| trans_TPM | matrix | TPMs of transcripts. | |
| Other elements | The list object may include other elements. | ||
| txi_genes.Rdata and txi_trans.Rdata | abundance | matrix | TPMs of genes/transcripts |
| counts | matrix | Read counts of genes/transcripts | |
| countsFromAbundance | character | Method used to generate read counts and TPMs | |
| length | matrix | Length of genes/transcripts |
Benjamini,Y. and Yekutieli,D. (2001) The control of the false discovery rate in multiple testing under dependency. Ann. Stat., 29, 1165–1188.
Bray,N.L., Pimentel,H., Melsted,P., and Pachter,L. (2016) Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol., 34, 525–527.
Bullard,J.H., Purdom,E., Hansen,K.D., and Dudoit,S. (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics, 11, 94.
Calixto,C.P.G., Guo,W., James,A.B., Tzioutziou,N.A., Entizne,J.C., Panter,P.E., Knight,H., Nimmo,H., Zhang,R., and Brown,J.W.S. (2018) Rapid and dynamic alternative splicing impacts the Arabidopsis cold response transcriptome. Plant Cell, tpc.00177.2018.
Gu,Z., Eils,R., and Schlesner,M. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics, 32, 2847–2849.
Guo,W., Calixto,C.P.G., Brown,J.W.S., and Zhang,R. (2017) TSIS: An R package to infer alternative splicing isoform switches for time-series data. Bioinformatics, 33, 3308–3310.
Guo,W., Tzioutziou,N., Stephen,G., Milne,I., Calixto,C., Waugh,R., Brown,J.W., and Zhang,R. (2019) 3D RNA-seq - a powerful and flexible tool for rapid and accurate differential expression and alternative splicing analysis of RNA-seq data for biologists. bioRxiv, 656686. doi: https://doi.org/10.1101/656686.
Law,C.W., Chen,Y., Shi,W., and Smyth,G.K. (2014) voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 15, R29.
Patro,R., Duggal,G., Love,M.I., Irizarry,R.A., and Kingsford,C. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods, 14, 417–419.
Risso,D., Ngai,J., Speed,T.P., and Dudoit,S. (2014) Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol., 32, 896–902.
Ritchie,M.E., Phipson,B., Wu,D., Hu,Y., Law,C.W., Shi,W., and Smyth,G.K. (2015) limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res, 43, e47.
Rokach,L. and Maimon,O. (2005) Clustering Methods. In, Data Mining and Knowledge Discovery Handbook. Springer-Verlag, New York, pp. 321–352.
Saracli,S., Dogan,N., and Dogan,I. (2013) Comparison of hierarchical cluster analysis methods by cophenetic correlation. J. Inequalities Appl.
Soneson,C., Love,M.I., and Robinson,M.D. (2016) Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4, 1521.
Zhang,R., Calixto,C.P.G., Marquez,Y., Venhuizen,P., Tzioutziou,N.A., Guo,W., Spensley,M., Entizne,J.C., Lewandowska,D., Have,S. Ten, Frey,N.F., Hirt,H., James,A.B., Nimmo,H.G., Barta,A., Kalyna,M., and Brown,J.W.S. (2017) A high quality Arabidopsis transcriptome for accurate transcript-level analysis of alternative splicing. Nucleic Acids Res., 45, 5061-5073.
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.1 magrittr_1.5 tools_3.5.1 htmltools_0.3.6
## [5] yaml_2.2.0 Rcpp_1.0.2 stringi_1.4.3 rmarkdown_1.15
## [9] knitr_1.25 stringr_1.4.0 xfun_0.9 digest_0.6.20
## [13] evaluate_0.14